Open Bug 1733908 Opened 4 years ago Updated 2 years ago

[Clang/TSAN bug, not wasm] Clang13 TSAN failure in wasm tiering

Tracking

(Not tracked)

Status:

REOPENED

People

(Reporter: lth, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: leave-open)

Attachments

(2 files, 1 obsolete file)

WIP: Bug 1733908 - Strengthen invariant on Code::tier2_ 4 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1733908 - Add temporary TSan suppressions 4 years ago Kris Wright :KrisWright 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1733908 - Add one more js::wasm::Code suppression to TSan 4 years ago Kris Wright :KrisWright 48 bytes, text/x-phabricator-request		Details \| Review

Lars T Hansen [:lth]

Reporter

Description

•

4 years ago

•

Edited

Clang13 TSAN reports a race condition in our tiering code, see https://firefox-ci-tc.services.mozilla.com/tasks/B4sU3VtNR2264ZjrYSqWTA/runs/0/logs/public/logs/live.log#L26976. The information so far:

The stack traces of the two accesses point to a read from Code::bestTier() on the main thread and a write from UniquePtr::reset() called from Code::setTier2() on a background thread.

The offending variable appears to be the unsynchronized UniquePtr tier2_ in the Code object: offset 16 bytes from the start of a Code object malloc'd by the main thread in ModuleGenerator::finishModule.

Access to tier2_ is controlled by the Atomic<bool> hasTier2_ variable. While hasTier2_ is false, tier2_ shall not be read; once hasTier2_ is set, tier2_ shall not be written (except for destruction). hasTier2_ is sequentially consistent and writing it publishes the value of tier2_ to other threads; all bits of tier2_ written by or safely seen by the thread that sets hasTier2_ become visible to all other threads that subsequently read hasTier2_.

The tier2_ variable is written by UniquePtr::reset, called from setTier2(). Before this, setTier2() checks that hasTier2_ is false.

The tier2_ variable is read by bestTier(), but only after checking that hasTier2_ is true. It is also read by some other functions in WasmCode.cpp and they all check that hasTier2_ is true before they access tier2_, using either 'if (hasTier2()) ...' or 'hasTier2() && ...'. [There is an exception, namely codeTier(t), but investigation shows that this is not an issue, and it can be strengthened to the point where it is no longer an exception.]

The hasTier2_ variable transitions from false to true only as a result of a call to commitTier2(). There is a single call to commitTier2() in the code, and it textually follows the call to setTier2(), and both calls are from within Module::finishTier2(). (FWIW, neither setTier2() or commitTier2() appear to have been inlined into finishTier2().)

The only way this race could happen is if hasTier2_ becomes set after it is checked in setTier2() so a concurrent thread passes the tier 2 test in bestTier() and reads tier2_ around the same time as setTier2() is writing to tier2_.

All the failures (there are 193 reported in this particular run) fail in the same way. And all fail with --test-wasm-await-tier2, which is sensible - most of the time, this is the only way we get to test tiering on CI.

Various hypotheses:

This is a huge number of race conditions for a test run and makes me suspicious that the analysis is not actually correct.
We could be starting multiple background compilations, possibly as a result of the recent rejiggering of the helper threads system, this would explain th race for sure.
It could also be a bug not in tiering but in our testing of tiering.

(There is a second bug here. In Code::commitTier2 there is an ordering bug: a MOZ_RELEASE_ASSERT reads tier2 data to check that it is indeed in place before setting the hasTier2_ variable. This is wrong because it is the writing of hasTier2_ that signals that it is ok to read tier2 data. The race is benign in the sense that it is the present thread that has created the tier2 data, and thus inconsistent data will never be read here. And it does not seem like this is the bug causing our current problems.)

Lars T Hansen [:lth]

Reporter

Updated

•

4 years ago

Summary: Clang13 TSAN failure in Code::commitTier2 → Clang13 TSAN failure in wasm tiering

Lars T Hansen [:lth]

Reporter

Comment 1

•

4 years ago

I think the second bug above, starting multiple background compilations for the same module, is very unlikely. commitTier2 checks that tier2 has not already been committed with a release assert, and if we had multiple tier-up compilations going we would have crashed here a long time ago.

Lars T Hansen [:lth]

Reporter

Comment 2

•

4 years ago

Attached file WIP: Bug 1733908 - Strengthen invariant on Code::tier2_ (obsolete) — Details

It is an invariant of Code::tier2_ that it is never read without the
atomic Code::hasTier2_ being true. This invariant was violated in a
couple of harmless cases, where code on the thread that had just
written tier2_ (the committing background thread) would read tier2_
again before the tier-2 code was committed by hasTier2_ being written.

TODO: I added a protocol here for setTier2 to return a borrowed
pointer to the tier-2 code. But it's possible that the caller of
setTier2 should just extract that pointer from values it already has
in hand. Discuss.

Lars T Hansen [:lth]

Reporter

Comment 3

•

4 years ago

I think we need to extract the machine code for Code::bestTier from the executable. We need to verify that there is no illegal reordering going on. It would be possible for the compiler to hoist the read of the tier2_ field above the guard without risking a crash, but only if it assumes that hasTier2() does not create an ordering relationship in the accesses. (That would be a compiler bug and thus is unlikely, but it would be good to rule it out.)

Lars T Hansen [:lth]

Reporter

Comment 4

•

4 years ago

Disassembly below from the build artifact. Argument registers, in order: RDI, RSI, RDX, RCX, R8, R9. Nonvolatile registers: RBP, RBX, R12, R13, R14, R15. Code::bestTier() takes only a single argument, namely 'this', which is passed in RDI. It selects a CodeTier t and then calls t->tier(), which first reads t->segment_ and then does stuff with that.

The disassembly shows that while the actual reading of tier2_ is guarded by a branch on the atomic value, the TSAN flagging of the read of tier2_ has been hoisted to before the conditional branch and is therefore incorrect, ie this is a TSAN/Clang bug. Andi, can you please move this bug to the appropriate Bugzilla component? I'll look into cleaning up the attached patch and landing it, but likely under a different bug.

(gdb) disassemble Code::bestTier
push   %rbp
mov    %rsp,%rbp
push   %r15             ;; save
push   %r14             ;;  callee-
push   %r12             ;;   saves
push   %rbx             ;;    registers
mov    %rdi,%rbx        ;; RBX = this
mov    0x8(%rbp),%rdi   ;; RDI = <return address>
call   0x21a640 <__tsan_func_entry()>    ;; flag fn entry
lea    0x18(%rbx),%rdi  ;; RDI = &this->hasTier2_
mov    $0x5,%esi        ;; RSI = magic argument
call   0x1fdb40 <__tsan_atomic32_load()> ;; signal the atomic load
mov    %eax,%r14d       ;; R14 = <result of atomic load>
mov    0x18(%rbx),%eax  ;; (void)atomic_load(this->hasTier2)
lea    0x10(%rbx),%r15  ;; R15 = &this->tier2_
lea    0x8(%rbx),%rdi   ;; RDI = &this->tier1_
call   0x2141c0 <__tsan_read8()>         ;; flag the read of tier1_
mov    0x8(%rbx),%r12   ;; R12 = this->tier1_
mov    %r15,%rdi        ;; RDI = &this->tier2_
call   0x2141c0 <__tsan_read8()>         ;; flag the read
test   %r14d,%r14d      ;; test hasTier2_
je     0x12389ed L1     ;; and jump if false
mov    0x10(%rbx),%r12  ;; R12 = this->tier2_
L1:
lea    0x10(%r12),%rdi  ;; RDI = &theTier->segment_
call   0x2141c0 <__tsan_read8()>         ;; flag the read
mov    0x10(%r12),%rbx  ;; RBX = theTier->segment_
lea    0x24(%rbx),%rdi  ;;  ... and so on
call   0x213a00 <__tsan_read4()>
mov    0x24(%rbx),%ebx
call   0x21a6d0 <__tsan_func_exit()>
mov    %ebx,%eax
pop    %rbx
pop    %r12
pop    %r14
pop    %r15
pop    %rbp
ret

Assignee: lhansen → nobody

Status: ASSIGNED → NEW

Flags: needinfo?(bpostelnicu)

Priority: P2 → P3

Summary: Clang13 TSAN failure in wasm tiering → [Clang/TSAN bug, not wasm] Clang13 TSAN failure in wasm tiering

Lars T Hansen [:lth]

Reporter

Comment 5

•

4 years ago

(In fairness, it doesn't have to be a compiler bug, it could be a bug in how the atomicity of the operations in mfbt/Atomics.h is made visible to the compiler. But given that the TSAN call is hoisted above the condition, it looks more like a compiler bug than like anything else.)

Lars T Hansen [:lth]

Reporter

Updated

•

4 years ago

Blocks: 1733970

Phabricator Automation

Comment 6

•

4 years ago

Comment on attachment 9244170 [details]
WIP: Bug 1733908 - Strengthen invariant on Code::tier2_

Revision D127421 was moved to bug 1733970. Setting attachment 9244170 [details] to obsolete.

Attachment #9244170 - Attachment is obsolete: true

Lars T Hansen [:lth]

Reporter

Comment 7

•

4 years ago

Not a wasm bug.

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → INVALID

Andi [:andi]

Updated

•

4 years ago

Blocks: clang-13
No longer blocks: 1733970

Status: RESOLVED → REOPENED

Component: Javascript: WebAssembly → General

Flags: needinfo?(bpostelnicu)

Product: Core → Firefox Build System

Resolution: INVALID → ---

BugBot [:suhaib / :marco/ :calixte]

Comment 8

•

4 years ago

The product::component has been changed since the backlog priority was decided, so we're resetting it.
For more information, please visit auto_nag documentation.

Priority: P3 → --

Mitchell Hentges [:mhentges] 🦀

Updated

•

4 years ago

Priority: -- → P3

Kris Wright :KrisWright

Comment 9

•

4 years ago

Attached file Bug 1733908 - Add temporary TSan suppressions — Details

Looks like there are a couple of impacted sites - these temporary suppressions should clear up the tests until they can be addressed.

Phabricator Automation

Updated

•

4 years ago

Assignee: nobody → kwright

Pulsebot

Comment 10

•

4 years ago

Pushed by kwright@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/871c6687088a Add temporary TSan suppressions r=decoder

Kris Wright :KrisWright

Updated

•

4 years ago

Keywords: leave-open

Kris Wright :KrisWright

Comment 11

•

4 years ago

Attached file Bug 1733908 - Add one more js::wasm::Code suppression to TSan — Details

Try pushes are showing one more data race that I didn't catch.

Noemi Erli[:noemi_erli]

Comment 12

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/871c6687088a

Pulsebot

Comment 13

•

4 years ago

Pushed by kwright@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e1333c140677 Add one more js::wasm::Code suppression to TSan r=decoder

Natalia Csoregi [:nataliaCs]

Comment 14

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/e1333c140677

Kris Wright :KrisWright

Updated

•

4 years ago

Assignee: kwright → nobody

Kris Wright :KrisWright

Updated

•

4 years ago

Blocks: tsan

Arthur K. (he/him)

Comment 15

•

4 years ago

Anything that the upcoming Clang 13.0.1 and WASI-SDK 14 can remedy here?

BugBot [:suhaib / :marco/ :calixte]

Comment 16

•

3 years ago

The leave-open keyword is there and there is no activity for 6 months.
:ahochheiden, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.

Flags: needinfo?(ahochheiden)

Alex Hochheiden [:ahochheiden]

Comment 17

•

3 years ago

I'm assuming this still needs to stay open, but I'm not sure.

:Decoder, you own the parent tsan bug, what do you think?

Flags: needinfo?(ahochheiden) → needinfo?(choller)

BugBot [:suhaib / :marco/ :calixte]

Comment 18

•

2 years ago

The leave-open keyword is there and there is no activity for 6 months.
:ahochheiden, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.

Flags: needinfo?(ahochheiden)

Alex Hochheiden [:ahochheiden]

Comment 19

•

2 years ago

Same as before.

Flags: needinfo?(ahochheiden)

Christian Holler (:decoder)

Comment 20

•

2 years ago

As long as we have a suppression for this in mozglue/build/TsanOptions.cpp, this bug needs to stay open (we could check if we still need it with Clang 15).

Flags: needinfo?(choller)

You need to log in before you can comment on or make changes to this bug.